Lesson 4


setwd('~/Downloads')
getwd()
## [1] "/Users/caicai/Downloads"

Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv',sep='\t')

qplot(x=age,y=friend_count,data=pf)


What are some things that you notice right away?

Response: 一般较多好友数的用户聚集在低年龄段,20岁左右。图中比较明显的直线可能是用户随意填写的年龄,例如69,100。 ***

ggplot Syntax

Notes:

ggplot(aes(x=age,y=friend_count),data=pf) +geom_point() +
  xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

Overplotting

Notes:

ggplot(aes(x=age,y=friend_count),data=pf) +
  geom_jitter(alpha=1/20) +
  xlim(13,90)
## Warning: Removed 5185 rows containing missing values (geom_point).

What do you notice in the plot?

Response: 从图中可以看到,年轻用户的好友数并没有之前看到的那么高,大多数年轻用户的好友数低于1000;在69岁处仍可以看到有一个峰值,虽然模糊了许多,因为我们把alpha设置为1/20,也就是一个圆圈变成20个点。但是看起来69岁和25,26岁年龄组的用户具有可比性。 ***

Coord_trans()

Notes:

ggplot(aes(x=age,y=friend_count),data=pf) +
  geom_point(alpha=1/20,position = position_jitter(h=0)) +
  xlim(13,90) +
  coord_trans(y="sqrt")
## Warning: Removed 5181 rows containing missing values (geom_point).

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

What do you notice?


Alpha and Jitter

Notes:

ggplot(aes(x=age,y=friendships_initiated),data = pf) +
  geom_jitter(alpha=1/20,position = position_jitter(h=0)) +
  coord_trans(y='sqrt') +
  xlim(13,90)
## Warning: Removed 5177 rows containing missing values (geom_point).


Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(pf,age)
pf.fc_by_age <- summarise(age_groups,
                          friend_count_mean = mean(friend_count),
                          friend_count_median = median(friend_count),
                          n = n())

pf.fc_by_age <- arrange(pf.fc_by_age,age)
head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13              165.                 74    484
## 2    14              251.                132   1925
## 3    15              348.                161   2618
## 4    16              352.                172.  3086
## 5    17              350.                156   3283
## 6    18              331.                162   5196

Create your plot!

ggplot(aes(x=age,y=friend_count_mean),data=pf.fc_by_age) +
  geom_line()


Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x=age,y=friend_count),data=pf) +
  coord_cartesian(xlim = c(13,70),ylim = c(0,1000)) +
  geom_point(alpha = 0.05,
             position = position_jitter(h=0),
             color = 'orange') +
  geom_line(stat = 'summary',fun.y = mean) +
  geom_line(stat = 'summary',fun.y = quantile, fun.args = list(probs = .1),
            linetype = 2,color = 'blue') +
  geom_line(stat = 'summary',fun.y = quantile, fun.args = list(probs = .5),
            color = 'blue') +
  geom_line(stat = 'summary',fun.y = quantile, fun.args = list(probs = .9),
            linetype = 2,color = 'blue') 

What are some of your observations of the plot?

Response:


Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

cor(pf$age,pf$friend_count)
## [1] -0.02740737
cor.test(pf$age,pf$friend_count,method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
with(pf,cor.test(age,friend_count,method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:

with(subset(pf,age <= 70) , cor.test(age, friend_count,
                                        method = 'spearman'))
## Warning in cor.test.default(age, friend_count, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2552934

Correlation Methods

Notes:


Create Scatterplots

Notes:

ggplot(aes(x = www_likes_received,y = likes_received),data = pf) +
  geom_point() +
  xlim(0,quantile(pf$www_likes_received,0.95)) +
  ylim(0,quantile(pf$likes_received,0.95)) +
  geom_smooth(method = 'lm', color = 'red')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).


Strong Correlations

Notes:

cor.test(pf$www_likes_received,pf$likes_received)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

Response:


Moira on Correlation

Notes:


More Caution with Correlation

Notes:

library(alr3)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data("Mitchell")
?Mitchell
head(Mitchell)
##   Month     Temp
## 1     0 -5.18333
## 2     1 -1.65000
## 3     2  2.49444
## 4     3 10.40000
## 5     4 14.99440
## 6     5 21.71670

Create your plot!

ggplot(aes(x=Month,y=Temp),data = Mitchell) + 
  geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot.

  2. What is the actual correlation of the two variables? (Round to the thousandths place)

cor.test(Mitchell$Month,Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

Notes:

ggplot(aes(x=Month,y=Temp),data = Mitchell) + 
  geom_point() +
  scale_x_continuous(breaks = seq(0,203,12))


A New Perspective

What do you notice? Response: 每年的温度变化呈现正弦图像 Watch the solution video and check out the Instructor Notes! Notes:


Understanding Noise: Age to Age Months

Notes:

ggplot(aes(x=age,y=friend_count_mean),data=pf.fc_by_age) +
  geom_line()


Age with Months Means

pf$age_with_months <- pf$age + (1 - pf$dob_month/12)

Programming Assignment

age_bymonth_groups <- group_by(pf,age_with_months)
pf.fc_by_agemonth <- summarise(age_bymonth_groups,
                          friend_count_mean = mean(friend_count),
                          friend_count_median = median(friend_count),
                          n = n())

pf.fc_by_agemonth <- arrange(pf.fc_by_agemonth,age_with_months)
head(pf.fc_by_agemonth)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1            13.2              46.3                30.5     6
## 2            13.2             115.                 23.5    14
## 3            13.3             136.                 44      25
## 4            13.4             164.                 72      33
## 5            13.5             131.                 66      45
## 6            13.6             157.                 64      54

Noise in Conditional Means

ggplot(aes(x=age_with_months,y=friend_count_mean),data = pf.fc_by_agemonth) + 
  geom_line() +
  coord_cartesian(xlim = c(13,70),ylim = c(0,450))


Smoothing Conditional Means

Notes:

p1 <- ggplot(aes(x=age,y=friend_count_mean),
       data=subset(pf.fc_by_age,age<71)) + 
  geom_line()

p2 <- ggplot(aes(x=age_with_months,y=friend_count_mean),
       data = subset(pf.fc_by_agemonth,age_with_months<71)) + 
  geom_line()

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p2,p1,ncol = 1)


Which Plot to Choose?

Notes:


Analyzing Two Variables

Reflection: 通过降低容器大小并增加容器数量,我们减少了估计每个条件平均的数据,噪声更多的图形是因为我们选择了更精细的容器。 ***

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!